{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine Learning (Pandas & Sklearn)\n",
    "\n",
    "* [Pandas](https://pandas.pydata.org/): Python Data Analysis Library\n",
    "    * Pandas is usually used for data reading and preprocessing\n",
    "    * `pip install pandas`\n",
    "    * Tutorial: <https://pandas.pydata.org/docs/getting_started/index.html>\n",
    "    * Cheat sheet: <https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf>\n",
    "* [Scikit-learn](http://scikit-learn.org/) (sklearn)\n",
    "    * Scikit-learn package is used for various machine learning algorithms, which range from classification, regression to clustering\n",
    "    * `pip install sklearn`\n",
    "    * Tutorial: <https://scikit-learn.org/stable/getting_started.html>\n",
    "    * API Reference: <https://scikit-learn.org/stable/modules/classes.html>\n",
    "    * Cheat sheet: <https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf>\n",
    "    \n",
    "In this demo, we use [UCI Machine Learning dataset](https://archive.ics.uci.edu/ml/datasets.php). Specifically, we use [Automobile](https://archive.ics.uci.edu/ml/datasets/Automobile) dataset. Given three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars, and predict the price of the cars."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import urllib.request\n",
    "\n",
    "print('Begin downloading automobile dataset...')\n",
    "\n",
    "# We use UCI Machine Learning dataset - Automobile here\n",
    "data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'\n",
    "description = 'http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names'\n",
    "if not os.path.isfile(\"automobile.data\"):\n",
    "    urllib.request.urlretrieve(data_url, 'automobile.data')\n",
    "    urllib.request.urlretrieve(description, 'automobile.names')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Attribute Information\n",
    "\n",
    "|     Attribute             |  Attribute Range |\n",
    "| :---: | :---: |\n",
    "|  1. symboling             |  -3, -2, -1, 0, 1, 2, 3 |\n",
    "|  2. normalized-losses     |  continuous from 65 to 256 |\n",
    "|  3. make                  |  alfa-romero, audi, bmw, chevrolet, dodge, honda |\n",
    "|  ...                       |  isuzu, jaguar, mazda, mercedes-benz, mercury |\n",
    "|  ...                       |  mitsubishi, nissan, peugot, plymouth, porsche |\n",
    "|  ...                       |  renault, saab, subaru, toyota, volkswagen, volv |\n",
    "|  4. fuel-type             |  diesel, gas |\n",
    "|  5. aspiration            |  std, turbo |\n",
    "|  6. num-of-doors          |  four, two |\n",
    "|  7. body-style            |  hardtop, wagon, sedan, hatchback, convertible |\n",
    "|  8. drive-wheels          |  4wd, fwd, rwd |\n",
    "|  9. engine-location       |  front, rear |\n",
    "| 10. wheel-base            |  continuous from 86.6 120.9 |\n",
    "| 11. length                |  continuous from 141.1 to 208.1 |\n",
    "| 12. width                 |  continuous from 60.3 to 72.3 |\n",
    "| 13. height                |  continuous from 47.8 to 59.8 |\n",
    "| 14. curb-weight           |  continuous from 1488 to 4066 |\n",
    "| 15. engine-type           |  dohc, dohcv, l, ohc, ohcf, ohcv, rotor |\n",
    "| 16. num-of-cylinders      |  eight, five, four, six, three, twelve, two |\n",
    "| 17. engine-size           |  continuous from 61 to 326 |\n",
    "| 18. fuel-system           |  1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi |\n",
    "| 19. bore                  |  continuous from 2.54 to 3.94 |\n",
    "| 20. stroke                |  continuous from 2.07 to 4.17 |\n",
    "| 21. compression-ratio     |  continuous from 7 to 23 |\n",
    "| 22. horsepower            |  continuous from 48 to 288 |\n",
    "| 23. peak-rpm              |  continuous from 4150 to 6600 |\n",
    "| 24. city-mpg              |  continuous from 13 to 49 |\n",
    "| 25. highway-mpg           |  continuous from 16 to 54 |\n",
    "| 26. price                 |  continuous from 5118 to 45400 |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read dataset\n",
    "Commonly in `csv` format (i.e. items separated by `,`)\n",
    "\n",
    "See [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for more usage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "attr = [\"symboling\", \"normalized-losses\", \"make\", \"fuel-type\", \"aspiration\", \"num-of-doors\", \"body-style\", \"drive-wheels\", \"engine-location\", \"wheel-base\", \"length\", \"width\", \"height\", \"curb-weight\", \"engine-type\", \"num-of-cylinders\", \"engine-size\", \"fuel-system\", \"bore\", \"stroke\", \"compression-ratio\", \"horsepower\", \"peak-rpm\", \"city-mpg\", \"highway-mpg\", \"price\"]\n",
    "data = pd.read_csv(\"automobile.data\",names=attr)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Different types of data\n",
    "Ref: <https://towardsdatascience.com/data-types-in-statistics-347e152e8bee>\n",
    "\n",
    "* Categorical\n",
    "    * Nominal: No order, e.g. sex\n",
    "    * Ordinal: With order, e.g. education\n",
    "\n",
    "* Numerical\n",
    "    * Discrete: Can't be measured but can be counted, e.g. # of times doing sth.\n",
    "    * Continuous: Can't be counted but can be measured, e.g. temperature"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Missing data\n",
    "We observe that this dataset exists lots of `?`, which means data is lost\n",
    "* Use [`data.isna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) to find out NaN\n",
    "* But we should firstly replace `?` with NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.replace(\"?\", np.nan, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To deal with missing data, we can\n",
    "* Delete rows with missing item (maybe most of the data are deleted)\n",
    "* Fill with **means / modes / maximums / other meaningful metrics**\n",
    "\n",
    "The following only gives a naive method.\n",
    "\n",
    "In practice, you **should** use different metrics for different types of attributes!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "modes = data.mode().iloc[0]\n",
    "data.fillna(modes,inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Separate data and label\n",
    "We use first several features (X) to predict price (Y, the last column)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = data[\"price\"]\n",
    "data.drop([\"price\"],axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Change text data (categorical) into number\n",
    "* Use number to denote different catalogs\n",
    "* Change categorical features into [one-hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_attr = [\"make\",\"fuel-type\",\"aspiration\",\"num-of-doors\",\"body-style\",\"drive-wheels\",\"engine-location\",\"engine-type\",\"num-of-cylinders\",\"fuel-system\"]\n",
    "for a in cat_attr:\n",
    "    data[a] = pd.Categorical(data[a]) # change column type to categorical\n",
    "    dummies = pd.get_dummies(data[a],prefix=\"{}_category\".format(a))\n",
    "    data = pd.concat([data,dummies],axis=1)\n",
    "data.drop(cat_attr,axis=1,inplace=True)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature selection\n",
    "* Variance\n",
    "* Pearson correlation $R$\n",
    "* $\\chi^2$ test\n",
    "\n",
    "### Dimensionality reduction\n",
    "* Principle Components Analysis (PCA)\n",
    "* Linear Discriminant Analysis (LDA)\n",
    "\n",
    "For more methods, please see <https://www.zhihu.com/question/29316149/answer/110159647>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Separate train and test data\n",
    "Since no test/validation data are available, we manually separate the data into train data and test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_size = int(len(data) * 0.8)\n",
    "X_train = data[:train_size]\n",
    "y_train = y[:train_size].to_numpy().astype(np.float64)\n",
    "X_test = data[train_size:]\n",
    "y_test = y[train_size:].to_numpy().astype(np.float64)\n",
    "print(\"Train size: {}\".format(len(X_train)))\n",
    "print(\"Test size: {}\".format(len(X_test)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "from sklearn.cross_validation import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(data, y, random_state=3)\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scaling\n",
    "* Normalization: $x'=\\frac{x-\\bar{x}}{\\sigma}$\n",
    "* MinMaxScaling: $x'=\\frac{x-\\min}{\\max-\\min}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "scaler = StandardScaler().fit(X_train)\n",
    "X_train = scaler.transform(X_train)\n",
    "X_test = scaler.transform(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training & Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "# create model\n",
    "lr = LinearRegression(normalize=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# model fitting\n",
    "lr.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# prediction\n",
    "y_pred = lr.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# evaluate\n",
    "from sklearn.metrics import mean_squared_error\n",
    "mean_squared_error(y_test,y_pred)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
